testingiauh912fandomcom-20200214-history
Reliability F.Heidary
Reliability is an essential characteristic of a good test, because if a test doesn’t measure consistently (reliably), then one could not count on the scores resulting from a particular administration to be an accurate index of students’ achievement. We can’t trust the scores from our tests unless we know about the consistency with which they measure. Only to the extent that test scores are reliable can they be useful and fair to students. Technically, reliability shows the extent to which test scores are free from errors of measurement. No classroom test is perfectly reliable because random errors operate to cause scores to vary or be inconsistent from time to time and situation to situation. The goal is to try to minimize these inevitable errors of measurement and thus increase reliability. There are some factors that introduce error into measurement. (1) ''Item Sampling': Because'' any test is only a sample of all possible items, the item sample itself can be a source of error. Longer tests are typically more reliable because we get a better sample of the course content and students’ performance. Obviously, a one-question test would not provide a reliable estimate of the students’ knowledge. But as more and more questions were added, one would obtain a sample that better fits the unit of instruction and yields scores that more accurately reflect real differences in achievement. So by increasing the length of the test (the size of the sample) we increase the consistency of our measurement. A longer test also tends to reduce the influence of chance factors such as guessing. There is a caveat at this point: Lengthening a test improves reliability only when the additional items are good quality and as reliable as the original ones. Adding poor quality items will actually induce error and lower reliability. Furthermore, there is a point of diminishing returns — if we add too many items, we risk student fatigue that will lower reliability. '''(2) ''Construction of the Items': Another major threat to reliable measurement is poorly worded or ambiguous questions or trick questions. Test questions that permit widely varying interpretations of what is expected are not likely to yield highly reliable scores. '(3) ''Test administration: Environmental factors such as heat, light, noise, confusing directions, and different testing time allowed to different students can affect students’ scores. The more such factors interfere with a student’s performance, the less faith we can have in the accuracy of the test scores. '(4) Scoring: ''Objectivity or the extent to which equally competent scores obtain the same score is a factor affecting reliability. An objective test is more reliable because the test scores reflect true differences in achievement among students and not the judgment and opinions of the scorer. Typically, essay tests have lower reliability than multiple choice tests because the subjectivity in scoring lowers reliability. This does not mean, however, that instructors should not use essay tests. There are things we can do to improve the reliability of essay tests. '(5) Difficulty of the Test: A test that is either too easy or too difficult for the class taking it will typically have low reliability. This occurs because the scores will be clustered together at either the high end or the low end of the scale, with small differences among students. Reliability is higher when the scores are spread out over the entire scale, showing real differences among students. (6) Student Factors: Student fatigue, illness, or anxiety can induce error and lower reliability because they affect performance and keep a test from being a measure of their true ability or achievement. Types of Reliability Test-Retest Reliability: '''To estimate test-retest reliability, you must administer a test form to a single group of examinees on two separate occasions. Typically, the two separate administrations are only a few days or a few weeks apart; the time should be short enough so that the examinees' skills in the area being assessed have not changed through additional learning. The relationship between the examinees' scores from the two different administrations is estimated, through statistical correlation, to determine how similar the scores are. This type of reliability demonstrates the extent to which a test is able to produce stable, consistent scores across time. '''Parallel Forms Reliability: '''Many exam programs develop multiple, parallel forms of an exam to help provide test security. These parallel forms are all constructed to match the test blueprint, and the parallel test forms are constructed to be similar in average item difficulty. Parallel forms reliability is estimated by administering both forms of the exam to the same group of examinees. While the time between the two test administrations should be short, it does need to be long enough so that examinees' scores are not affected by fatigue. The examinees' scores on the two test forms are correlated in order to determine how similarly the two test forms function. This reliability estimate is a measure of how consistent examinees’ scores can be expected to be across test forms. '''Decision Consistency: '''In the descriptions of test-retest and parallel forms reliability given above, the consistency or dependability of the ''test scores ''was emphasized. For many criterion referenced tests (CRTs) a more useful way to think about reliability may be in terms of examinees’ classifications. For example, a typical CRT will result in an examinee being classified as either a master or non-master; the examinee will either pass or fail the test. It is the reliability of this classification decision that is estimated in decision consistency reliability. If an examinee is classified as a master on both test administrations, or as a non-master on both occasions, the test is producing consistent decisions. This approach can be used either with parallel forms or with a single form administered twice in test-retest fashion. '''Internal Consistency: '''The internal consistency measure of reliability is frequently used for norm referenced tests (NRTs). This method has the advantage of being able to be conducted using a single form given at a single administration. The internal consistency method estimates how well the set of items on a test correlate with one another; that is, how similar the items on a test form are to one another. Many test analysis software programs produce this reliability estimate automatically. However, two common differences between NRTs and CRTs make this method of reliability estimation less useful for CRTs. First, because CRTs are typically designed to have a much narrower range of item difficulty, and examinee scores, the value of the reliability estimate will tend to be lower. Additionally, CRTs are often designed to measure a broader range of content; this results in a set of items that are not necessarily closely related to each other. This aspect of CRT test design will also produce a lower reliability estimate than would be seen on a typical NRT. '''Inter-rater Reliability: '''All of the methods for estimating reliability discussed thus far are intended to be used for objective tests. When a test includes performance tasks, or other items that need to be scored by human raters, then the reliability of those raters must be estimated. This reliability method asks the question, "If multiple raters scored a single examinee's performance, would the examinee receive the same score. Inter-rater reliability provides a measure of the dependability or consistency of scores that might be expected across '''raters. References Bachman,L,F. (1990). Fundamental Consideration in Language Testing, Oxford University Press Henning,G. (1987). A Guide to Language Testing: Development, Evaluation, and Research. Foreign Language Teaching and Research Press